ABSTRACT
The paper presents an open-domain Question Answering system for Romanian, answering COVID-19 related questions. The QA system pipeline involves automatic question processing, automatic query generation, web searching for the top 10 most relevant documents and answer extraction using a fine-tuned BERT model for Extractive QA, trained on a COVID-19 data set that we have manually created. The paper will present the QA system and its integration with the Romanian language technologies portal RELATE, the COVID-19 data set and different evaluations of the QA performance. © 2022, Institute for Bulgarian Language. All rights reserved.
ABSTRACT
Automatic speech recognition (ASR) systems that use word-based language models require periodical updates to include new named entities (e.g. coronavirus, COVID-19) or collocations. Moreover, in particular for the Romanian language, the new hyphenated words pose additional problems. In this context, our study presents SpeeD's efforts in collecting new text corpora and using them for language modelling in the context of ASR. We also present the improvements made in the text normalization module to address the problems posed by hyphenated words. We evaluate the resulting language models both in terms of their ability to predict future words (perplexity and out-of-vocabulary rate) and in terms of their usefulness in ASR (word error rate). We report ASR relative improvements of around 10% for spontaneous speech, with small degradations for read speech.